Sociology 229:  Advanced Regression Models

 

Short Assignment #2:  Count Models

 

Due:  Start of class (9:00) January 26

 

This assignment requires a dataset on the course website entitled “Assignment 2 Count Data.dta”.  The dataset includes information on approximately 2500 people from the GSS.  The dataset includes a variable “memnum” which is the sum of a series of dummy variables indicating individual membership in different types of voluntary associations such as school organizations, religious groups, and sport/hobby associations.  It isn’t exactly equivalent to the total number of memberships in associations that an individual has – because some people might join more than one association of a given type – but it is close, so for the purposes of the assignment you can describe it as total memberships.  Individuals who have many memberships in voluntary organizations are often said to be “civically engaged” and communities with lots of memberships are believed to have high levels of cultural capital.

 

Notes:  Education is measured in years.  TV watching is measured in hours per day (on average).

 

  1. Download the Assignment 2 dataset
  2. Create your own “do” file that opens the data
  3. Use the “des” and “tabulate” commands to examine the variable memnum, which will be the dependent variable of your analyses.
    1. Des memnum
    2. tab memnum
    3. You may also wish to use ‘des’ and ‘tab’ to examine other variables.
  4. Run a poisson regression model looking at the effects of age, gender (male dummy), education, household income, employment status (reference group is not employed), religious participation, and TV viewing habits on membership in voluntary associations.
    1. poisson memnum age dmale educ income married empfull emppart attend tvhours
  5. Run a negative binomial model using the same variables
    1. nbreg memnum age dmale educ income married empfull emppart attend tvhours
    2. Note the hypothesis test regarding the dispersion parameter alpha.  Is there evidence of overdispersion?
  6. Run a negative binomial model, but request “incidence rate ratios ratios” instead of raw coefficients
    1. nbreg memnum age dmale educ income married empfull emppart attend tvhours, irr
  7. Use the adjust (Stata 9/10) or margins (Stata 11) command to generate predicted counts for a hypothetical individual with the following properties:
    1. adjust educ=12 age=20 dmale=1 income=10 married=1 empfull=1 emppart=0 attend=2 tvhours=2, exp
    2. margins , at(educ=12 age=20 dmale=1 income=10 married=1 empfull=1 emppart=0 attend=2 tvhours=2)
    3. Recompute predicted probabilities for two additional values of educ: 8 and 16 years of education.
  8. Examine how predicted counts vary across levels of an independent variable tvhours, with other variables held at the mean of all cases.  In stata 9/10 you need to manually specify the variable means.  Stata 11 automates this with the option “atmeans”:
    1. adjust age=45.24387 dmale = .42599 educ=13.077 income = 14.5196 married = .5498 empfull=.521 emppart =.104 attend=3.87, exp by(tvhours)
    2. margins , at(tvhours=(0 1 2 3 4 5 6 7 8)) atmeans
  9. Run a zero inflated negative binomial model, including the same set of covariates in both the main equation and the inflation equation.
    1. zinb memnum age dmale educ income married empfull emppart attend tvhours, inflate(age dmale educ income married empfull emppart attend tvhours)
  10. Answer questions below.

 

 

Question 1:  Based on the overdispersion parameter alpha, which model was preferred – the poisson regression or the negative binomial model?  Were the results similar overall?  Did the choice of models affect any conclusions you might draw from the analysis?

 

Question 2:  Interpret the coefficients for education (measured in years) and television viewing time (hours per day).  Discuss the raw coefficient, the incidence rate ratio (which is analogous to an odds ratio), and the % difference in incidence rate.  (Note:  since the model involves constant exposure, which is often the case, you can use the word “count” instead of “incidence rate” in describing results.  If exposure varies, you should use the term “rate” rather than “count”.)

 

Question 3:  Discuss the impact of education and viewing on membership in associations, based on predicted probabilities computed above (one of which is for a hypothetical case, and one of which is for the “average” case).

 

Question 4:  Comment on the results from the zero-inflated negative binomial model.  (Note that a the “inflate” equation predicts zeros, so a negative coefficient corresponds to a positive impact on being non-zero.  So, you need to ‘flip’ signs to interpret results consistently with the count model.)  Often the effects are consistent between both models, but sometimes a variable mainly affects the count equation or the inflation equation only.  Does this happen?  Suggest an interpretation.

 

Turn in the following:

  1. Your “do” file, containing all commands you used for this assignment.
  2. The output from steps #5, #6, #8b, and  #9.
  3. Answers to the questions.